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Abstract 

In this paper, feedforward neural networks are presented that have nonlinear 
weight functions based on look-up tables, that are specially smoothed in a regular- 
ization called the diffusion. The idea of such a type of networks is based on the 
hypothesis that the greater number of adaptive parameters per a weight function 
might reduce the total number of the weight functions needed to solve a given prob- 
lem. Then, if the computational complexity of a propagation through a single such 
a weight function would be kept low, then the introduced neural networks might 
possibly be relatively fast. 

A number of tests is performed, showing that the presented neural networks may 
indeed perform better in some cases than the classic neural networks and a number 
of other learning machines. 

keywords: feedforward neural networks, nonlinear regression, nonlinear weight 
functions, generaHzation, diffusion 



1 INTRODUCTION 

Introducing adaptive nonlinear processing into the weight functions of feedforward, 
densely connected neural network gives the network the power of the order of N"^ of 
adaptive nonlinear processing units, where N is the number of nodes. For large N ^ it 
can be a substantial difference in comparison to the respective adaptive nonlinear 
processing units in the classic neural networks. 

The feedforward neural networks presented in this paper have nonlinear weight func- 
tions based on look-up tables - a look-up table represents nodes of a piecewise linear 
interpolation of the arguments of a weight function. Thanks to this, only one or two 
nodes in the table need to be read to propagate a signal. This causes that only a chosen 
subset of the parameters is used during a propagation of a given signal, what can make 
the propagation time reasonable despite a very large number of adaptive parameters. 
Another quality of the described neural networks is that the subsequent parameters in 
the look-up table control propagation of subsequent ranges in the domain of the weight 
function. Because of this, it can be hypothesized that two subsequent adaptive param- 
eters in the look-up table are with some propability likely to control close points in the 
input space of the neural network. An advantage from the property can be taken dur- 
ing additional regularization of the weight function during the training process. Such a 
regularization is very important in the case of the presented neural networks, because it 
reduces the problems with the lack of smoothness of the look-up table b ased adaptive 



functi ons, that may cause serious generalization problems as reported by iPiazza, et al 
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The regularization, called diffusion, works as follows. During the training process, the 
values of the propagated signals are traced to build 'visit' tables related to the frequency 
of falling of the look-up table arguments into different ranges of values. Each nonlinear 
weight function has such an accompanying visit table. On the basis of the tables, a 
regularization is performed, which 'diffuses' values within the look-up tables from the 
more 'visited' regions of the look-up tables to the less 'visited' regions of the look-up 
tables. This way, values of signals are extrapolated within a weight function, and in effect 
it might be likely that the values representing the generalized function are extrapolated 
by the discussed neural networks as well. 

Schematic examples of generalization in the cases of a linear, spline-like and LUT 
weight functions are illustrated in Fig. ^ The lack of smoothness of the weight function 
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Figure 1: A schematic example of generalization using (a) linear, (b) smooth and (c) 
LUT weight function types. / are arguments and O are values of the weight functions, 
and crosses schematically denote points that minimize the training error function. 



in (c) may clearly decrease the generahzation qualitv IPiazza et al. Some adap 



five parameters are completely independent from the training samples in that case. The 
weight function (b) is smooth and has a relatively small number of adaptable parameters 
to improve generalization - networks with similar, spline-lik e adaptive activatio n func- 
tions have been pr esented bv Uncini Capparelli and Piazza Uncini et all Il998||. Vecci 



Piazz a and Uncini Vecci et al. |l998l |. and Guarnieri and Piazza lOuarnieri and Piazzal 
Such a lower number of adaptive parameters, however, is exactly what we want 
to elude in the proposed architecture. An example of generalization using a weight func- 
tion with a large number of adaptable parameters, like in (c), without and with the 
diffusion of the function, is shown in Fig.Cfa) and Fig.lHIa), respectively. It can be seen, 
comparing these figures and the respective generalizing functions, that the diffusion may 
clearly improve the generalization. 

The use of high number of adaptable parameters for each connection may r aise the 
questio n about bounds on the generalization performance of a learning machine VapnikI 



1995al |. Yet, for first it should be noted, that the high number of the parameters per 



connection might result in a lower total number of connections needed. Secondly, even if 
the introduced neural networks might obviously have a very high Vapnik Chervonenkis 
dimension, so the risk bound would be very high, the VC dimension takes into account 
only the maximum number of training points that can be shattered by the learning 
machine. Thus it clearly can be a very 'loose' bound, in the sense that the other qualities 
of the learning machine can make the actual risk much lower than the bound. For 
example, the discussed diffusion can make the subsequent adaptable parameters in the 
LUT weight function dependent on each other, yet in computing of the VC dimension it 
is not taken into account at all. 
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2 NETWORKS WITH DIFFUSED WEIGHT FUNC- 
TIONS 



Let us separate the notion of a neuron into a number of connections and a node. We do so 
because there are two different types of connections in the discussed neural networks. The 
weight function is associated with a given connection, and the combination and activation 
functions are associated with a node. This way of describing a neuron is illustrated in 
Fig. El 
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Figure 2: Structure of an example neuron. 

Let n denote the iteration number. In a connection i, the computation of the con- 
nection output value Oi{n) on the basis of the connection input value Ii{n) is done by 
using the connection weight function. 

The combination function ut jn] of a node k sums its arguments, like in the classic 



neurons 



McCulloch and Pittj 



1M2 



where M is a set of indexes of the input connections of the node k. 
The activation function uf, of a node k is sigmoid-like 



Vk 



{n) = uUuUn)) = tanh , (2) 



where ?/fc(?^) is the output value of the node k. The activation function softly clamps the 
combination function value, so that the value fits into the domains of the LUT weight 
functions. 

A linear connection i has a weight function of the form 

0,(n) = w'Mn), (3) 

where wl is the connection scalar weight. 

A LUT connection i has the following weight function 

Oi{n) = wj{n)I,{n) + r{wi{n),Ii{n)), (4) 

where w\{n)I{n) is a component further called the linear one and r(w* (n), is 
a component further called the LUT one. The coefficients w\{n) and w*(n) are the 
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parameters of the connection weight function. The parameter wKn) is a scalar and the 
parameter w*(n) is the LUT of the weight function. The function r(w* (n), /(n)) is 
determined by a curve being a hnear interpolation of several points p'j{n){P ,wl j{n)), 
where j = . . . rrcs — 1 and wl j{n) is a jth element of w*,(n). The first coordinate of the 
points on the curve denotes arguments and the second one values of the function. The 
values P are equally distributed values as follows 



— -^min ~l~ -I (-^max -^min)- (5) 



Let US call r^es the LUT component resolution. The coefficients Imin and /max are /j(n) 
minimum and maximum allowable values, respectively. These values are equal to the min- 
imum and maximum values of the activation functions. Let the function r(w* (n), /(n)) 
be computed using the following piece-wise linear interpolation: 



r(wi{n),I{n)) = [lS{n)\ + 1 - 5(n))<^L5(n)J W + {s{n) - L5(n)j)<L5(n)J+iH 

S{n) = M^Il^ _ 1) . 

max ^ mill 

(6) 

Al ternative l v, of c ourse, another interpolation type, for example cubic spline interpo- 
lation [^re^iiii |lfl69l |. Ide Booil |l978l |. could be used. 



The memory overhead of the presented networks grows linearly with r,-cs - each non- 
linear weight function needs only one additional table for the diffusion process discussed 
in Sec. I,S.2.2( and the additional table is of the size of rres- Thus, even with networks 
having thousands of connections, and rres being of the order of hundreds, the overhead 
can be low on modern computers, because such computers often feature hundreds of 
megabytes of memory. 

Because the piecewise linear interpolation requires only one or two adaptive param- 
eters, the ratio of the number of the used parameters within a single propagation to the 
number of all of the parameters is less that or equal to 2/rres- Let us call the quality 
of using only some chosen adaptive parameters the selective parameters property. The 
property can make propagation very fast, while still retaining a large number of adaptive 
parameters available to the training process. The relatively fast propagation in the pre- 
sented networks will be demonstrated in tests. If, instead of the piecewise interpolation, 
a function like a single polynomial would be used for a weight functions, then the prop- 
erty of selective parameters would obviously not apply, because each adaptive parameter 
would be needed to find a value of the function. 

The linear component w\{n)Ii{n) has the role of generalizing linear patterns. It has 
been found in tests that a neural network with both linear and the LUT components may 
in some cases perform substantially better than a network with only the LUT components. 

The intro d uced networks are fully connected multilayer feedforward neural networks 
BishopI ^sB], Hertz et a,l. |l991 |. All linear connections that would be used in a basic 



layered feedforward neural network, except of these from bias elements, are replaced with 
LUT connections in the presented networks. Bias elements have a constant output, and 
therefore there is no need for a LUT connection. A sample such a NN is illustrated in 
Fig. El 
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Figure 3: An example of a fully connected multilayer feedforward NN with nonlinear 
weight functions, Li denotes the ith layer. 



3 THE LEARNING ALGORITHM 

The presented NNs have their distinct learn i ng algorithm consis t ing of a modified error 
backpropagatioi feumelhart and McClellandl |l99()l | . BishopI 1995l | , Hertz et al. I |l99l| and 
a regularization of the weight functions. 

A schematic diagram of a single iteration of the learning algorithm is presented in 
Fig.^ In the beginning of an iteration, attributes of the training samples are propagated 
through the network. The propagation needs derivatives of the weight functions. In the 
case of the LUT weight function, approximation of the derivative is computed instead. 
During the propagation, weight functions are adapted and the visit functions are updated. 
Then, regularization is performed. Within the regularization, weight decay is performed 
on both the linear and nonlinear weight functions. Also, the nonlinear weight functions 
and the visit functions are diffused, according to the values in the visit functions. 

Note that the order of the mentioned operations is not critical - the training process 
may have many iterations, and so the progressive changes within a single training iteration 
are usually relatively low. In the detailed description of the learning algorithm later in 
this section, for mathematical completeness, the blocks presented in Fig. ^ are tied in a 
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Figure 4: Diagram of a single iteration of the learning algorithm. 



specific order, but the order is practically unimportant. 



3.1 TRAINING 



On-line training w i th err or backpropagation Rumelhart and McClellaiidl |l99(il |. BishopI 
[l995], Hertz et a1.l |l99l| is used. 



Let a training set be given. Each sample k in the set has i argument attributes and j 
value attributes |xo, 2;^, . . . rr^l^, dg, d^, . . . dj_i^ and we want the network to generalize 
the relation between the attributes with a function mapping the argument attributes 
to the value attributes. Let Cj be an error function derivative backpropagated to the 
connection i, and // be the learning step. 

To keep the descriptions of the training algorithm and the regularization algorithm 
separate, here the regularization functions Rdiw) and Rs{w,Aw) will only be briefly 
menti oned, and describ e d in d etail in Sec. I,S.2I The function Rd{w) is a simple weight 
decay iKrogh and Hert j |l992l | - its value is its argument multiplied by a value in the 
range (0, 1). The function Rs{w, Aw) is a kind of an indirect weight decay, that does not 
have some drawbacks of the regular weight decay, but let us now for simplicity assume 
that 

Rs{w,Aw)=Aw. (7) 

The regularization of the LUT component can be much more computationally complex 
that the regularization of the linear connection and of the linear component, because of 
the number of adaptive parameters of the LUT component. Because of this, while the 
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regularization of the linear connection and of the hnear component is performed in every 
training iteration, the regularization of the LUT component is performed only in some 
training iterations, in which the following is true: 

C > Ci(n), (8) 

where 

Ci(n) =rand(0.0, 1.0), (9) 

and the function rand(0, 1) is a uniform random number generator which returns a ran- 
dom value within (0, 1). The more the ( coefficient is lower than 1, the less is the mean 
computation complexity per training iteration, but the quality of the regularization may 
be worse. The LUT regularization can accordingly be 'stronger' to counterbalance its 
exclusion in some iterations. The random exclusion ((HJ, instead of a regular one, is used 
to rule out possible resonance with the training samples. Condition ^ will be repeated 
in some equations in the later sections. 



3.1.1 ADAPTING WEIGHTS OF THE LINEAR CONNECTIONS 

The weights are randomly initialized before training, with values in the range (—0.5, 0.5) . 

Let there be a linear connection i with the input value Ii{n). The weight of the linear 
connection is adapted as in the classic propagation, with the weight decay applied: 



,{n + l) = Rd(wl{n) + Awi{n)), (10) 



where 



Awiin) = -Rs(^wi{n),fiei{n)I,{n)y (11) 



3.1.2 ADAPTING WEIGHTS OF THE LUT CONNECTIONS 

A LUT component is randomly initialized before the training of the neural network with 
small values in the range (—0.5,0.5) and a constant derivative. 

Let there be a LUT connection i. Let the linear component weight wl{n) be adapted 
analogously to the weight in the linear connection: 

wKn + 1) = Rd{wi{n) + i^Awl{n)), (12) 



where 



Awfin) = -Rs(wi{n),fie^{n)Iiin)y (13) 



The coefficient u specifies a relation between the adaptation speed of the linear component 
and the adaptation speed of the LUT component. Increasing the value may cause the 
linear patterns to have a greater impact on the generalizing function. 

Let the LUT component weight w* (n) be adapted also in a similar way of that of the 
linear connection: 



■(wr(n + l),/i(n)) =r(w;(n),/i(n)) +A<(n) 

Rd(^r(wi**{n),Ii{7i))^ ifC>Ci , (14) 
r(wr(n),/i(n)) if C < Ci 



'(w;*(n+l),7i(n) 
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where 

Awlin) = -i?,(^r(w;(n),Ii(n)),/iei(n)y (15) 

Because the value of the LUT component is not a product of Jj and of r(w* (n), /j(n)), 
but is the value r(w* (n), /j(n)) itself, the term ej(n) is used in (fTK|l instead of the term 
ei{n)Ii{n) as in (|TT|l . The symbol * in lfT5|l and later in this section denotes a value before 
the diffusion of the LUT component. The diffusion which is described later in Sec. I,S.2.^ 
The equation (flUl changes a value of the LUT component, but it does not decompose 
the change on the individual values in the LUT. Let the following conditions be given on 
the adaptation of the LUT. If Ii{n) is equal to a P coordinate of an approximated point 
p^j{P ,wl j{n)), then only that point value wl j{n) is changed. To fulfill (fTHl. 

wl**{n + 1) = wlj{n) + Awl{n). (16) 

Otherwise, 3j £ (0, rres — 2), Ii{n) € {P , P^^). Then, values w^ j and are changed, 

using the amounts Att;£(n) and Aw^j^{n), respectively, 

wi,**{n +1) =wi An) + Awl{n) 

(17) 

<*;+i(n + 1) = <j+i(n) + Aw'^in) 

such that 

Aw^^in) _ h{n)-I'{n) 



Awlin) lJ+i(n) - /i(n)' ^ ' 

Therefore, one of the modified points, denoted k, whose value is possibly nearer to 
/j(n), has its if* ^ value changed by a greater amount. To fulfill (|TH) and (fTH]l. Awl{n) 
and Awfj{n) have the following form 

Awlin) = Awi in) ''^ tl = /j+i in) - h in) 

2.,-2.. + l _ ^^^^ 

Aw^n) = Awlin) . ''f m = liin) - P^in) 

As can be seen, the quality of selective parameters apply also to the adapting of the 
parameters - only one or two parameters are modified during training within a single 
iterations, independently of the value of rres- 

3.1.3 APPROXIMATED DERIVATIVE OF LUT WEIGHT FUNCTION 

For error backpropagation to work, a derivative of a weight function in respect to a 
connection input value is needed. In the case of a LUT connection, an approximation of 
the derivative will be used instead. Let it be described as follows 



dv(wiin),Iiin) 
dl~{n) 



4in) + ciin), (20) 



where cj(n) and c* (n) are a derivative of the linear component and approximated deriva- 
tive of the LUT component, respectively. Of course, 

clin)=wjin). (21) 
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In the case of c*(n), a derivative approximation evaluated as the difference of neighboring 
LUT values is not used, because it c ould be too sensitive t o individual weight changes 
and could cause numerical instabilitv lGuarnieri et a,l. Instead, the approximated 



derivative of a LUT component r of a connection i is given by 

r(wl,c{Ii{n) + a)j - r(wl., c{Ii{n) 



4(n) =11 A\\-'J2 



a(.A c{Ii{n) + a) - c(Ii(n) - a) 

2 



A = {ai,amai,a^ai,...af} aj < at amaj > at ^ ^22) 

-^min if 5 < -^min 



c{q) 



-^max q> I, 



max 



Therefore, the approximated derivative is the mean of several differential ratios of the 
LUT component. The coefficient a/ is the minimum value of a, the coefficient at is the 
maximum value of a that possibly exists, and am determines the number of the ratios. 

3.2 REGULARIZATION 

Two types of regularization are used in the proposed neural networks - of absolute values 
of weight functions and of the diffusion of the LUT components. 

3.2.1 REGULARIZATION OF ABSOLUTE VALUES OF WEIGHT FUNC- 
TIONS 

This type of regularization is a kind of weight decay, that tries to prevent absolute values 
of the weight functions from getting too large. Weight decay can improve generalization 
Krogh and Hert j |l992l |. Regularization is used for the adaptable parameter wl of a 



linear connection, the linear component adaptable parameter and the LUT component 
adaptable parameters w*. To regularize weights, the functions RsiwjAw) and Rd{w) are 
used. The functions were already used in the equations in Sec. LS.ll In this section, the 
functions will be described in more detail. 
Let the function Rd{w) be as follows 

Rdiw) = (1 - Rt)w. (23) 

As can be seen, it is a simple weight decay, whose strength is determined by the coefficient 
Rl 

A weight decay like in (|23|l may have the disadvantage of preventing the training 
process of converging exactly into a local minimum - the decay always 'p ushes' the weights 
towar d zero. Yet this type of regularization can still be very important Krogh and Hertz! 



In the presented algorithm, the following solution is proposed. The weight decay 
l(23|) . should it be required to be too strong, is partially substituted by another type of 
weight decay, that operates not directly on the values of weights, but instead on the 
gains of the weights. Let the another type of weight decay be represented by the function 
Rsi^w, Aw), that has the following equation: 

R,{w,Aw) = { ^> (24) 

Aw if u; = V i?^ = 
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where determines the level of the regularization. As can be seen, the more w is 
positive, the more the ratio ^^^"'^ decreases as Aw increases, and conversely, the 
more w is negative, the more the ratio ^"^^'^^^ increases as Aw increases. That type of 
regularization may both slow down increasing of absolute values of the weight functions 
and speed up decreasing the absolute values. 

3.2.2 DIFFUSION OF THE LUT COMPONENT 

A given training sample can, by its very presence, make the point that it represents in 
the input space of the regressor, and the surroundings of the point, more statistically 
significant or defined. The 'spreading' of the significance over the surrounding s is used 
in me t hods like for e xample the nearest neighbor or cubic spline interpolation Grevillel 

Assuming that the weight functions are not very 'jagged', it can be 
hypothesized that two subsequent adaptive parameters in the look-up table are likely to 
control close points in the input space of the neural network. The idea behind diffusion is 
based just on that. A sample has a 'significance' and while its attributes are propagated 
through the network, the 'significance' is marked in respective regions of the weight 
functions by the means of the accompanying visit function. In the diffusion process, 
values in the more 'significant' regions of the weight functions are heuristically 'diffused' 
to the less 'significant' regions of the weight function, thus heuristically performing an 
interpolation by the 'spreading' of the significance of samples like in the mentioned nearest 
neighbor or cubic spline interpolations. 

The algorithm of diffusion was constructed so as to have low memory requirements 
and low time complexity. For each LUT component it needs only one additional visit 
table - of the size of LUT of the component, and the time complexity is roughly linear 
to the resolution of the LUT. On the other hand, the algorithm is very far from Fick's 
diffusion equation, and the 'significance' estimation is heuristic. The algorithm, however, 
keeps the time of a single iteration relatively short. This may be important if there are 
many training samples, and in effect many iterations to propagate the attributes of the 
samples are required. 

In the process of diffusion, roughly speaking, the weighted density of occurrence 
of LUT component arguments, at different regions of the LUT component domain, is 
computed. In computing the density, there is a higher importance given to the more 
recent iterations. There are two reasons for giving the more recent iterations a higher 
importance. First, because of the adaptation of weight functions during training, the 
way of propagation of signals may gradually change, and we want the weight functions 
to fit to the more 'current' way of propagation of the signals. Secondly, because we use 
an on-line learning method, there may be possible trends in the training data. 

Values related to the densities are stored in the visit table. For each single value 
in the LUT weight function, there is a single corresponding value in the visit table. 
Values within the weight function LUT having the relatively higher corresponding values 
in the visit table are 'diffused' to these neighboring ones that have the relatively lower 
corresponding values in the visit table. Let the mean of two such neighboring values 
of the weight function LUT before diffusion be m. After diffusion, the value that had 
higher corresponding value in a visit table moves less towards m than the other value. 
The diffusion 'spans' incrementally the LUT component function in regions relatively less 
frequently modified or not modified at all, where the 'spanning' regions are these relatively 
more frequently modified. There is no binary division only into extreme 'spanning' and 
'spanned' regions, of course, as the visit functions are multivalued. 



10 



The diffusion is also applied to the visit table. This is because if a value 'diffuses' to 
a neighboring one, an 'importance' of the value 'diffuses' also. 



During the diffusion process, the LUT functions are also 'smoothed' by decreasing 
the absolute differences between neighboring val ues in the LUTs, to reduce the problems 
caused by the lack of smoothness as reported in iPiazza, et a,l. I US- 



Let there be two subsequent values wl*j{n + l) and wl*j_^_l{n + l) of a LUT component 
r(w**(n + l),/j(n)), as described in Sec. 13.11 We want to smooth r{wl*{n + l),/j(n)) by 
making the absolute difference \wlJ_^_l{n + 1) — 'wl j{n + 1)| smaller than \wl*j^i{n + 1) — 



*j(n + 1)1- We also possibly want to 'diffuse' each wl*j{n + 1) value to ■wlj_i{n + 1) 
and wlj_^_l{n + 1), depending on visit table values. Let a LUT element wlj{n), j = 
0. . . Tres — 1, have its associated visit table element Vj{n). Let Vj*{n + 1) be the visit 
function values before LUT component regularization. The following equation fulfills the 
discussed assumptions: 



^i,L( I i> _ ^ tanh(i;^(jj) _2 

dj = wl*j^i{n + 1) - wi*j{n + 1) 
<.(n+l)+<.+i(n + l))/2 

w'^Q{n + 1) = wl'Q{n + 1) 
(n + 1) = {wijin + 1) + wif{n + 1)) /2 1 < j < r,,, - 2 



w, 



res 



l(7^ + 1) = u;;,.^,^^_i(n + 1) 



The coefficients iij^ and determine a smoothing level and a diffusion speed, respec- 
tively. To increase numerical precision, the term is computed directly from the V^* 
values, which is important because these values may get extremely low. The computation 
of the two values wl'^{n +1) and nf'^^{n + 1), and then computing of their mean, with 
the exception of special cases at the two LUT values having indexes and rres — 1, is 
performed to maintain symmetry of the regularization of the LUT. 
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3.2.3 COMPUTING THE VISIT TABLE 



Let us finally discuss computing the values in the visit table. Let the values Vj*{n), that 
is the values of the visit table before a possible diffusion, have an equation as follows 



Vl*{0) = Vp 



Vj*in + 1) 



+Rl[[Si{n+l)\ + l- 
-Si{n + l))[l-Vl{n 

+i?^(Si(n + l)- 
-lS,{n+l)\)(l-Vjin) 



l{x) = max (x, 1/min) 
Ij(n) - /min 



Irr 



(rr. 



' max ^mm 
. . . Tvps - 1 



ifj=lSi{n + l)\ 



(26) 



ifj=\Si{n + l)-] 



where Si{n + 1) scales like in (EJ. The coefficient Vp is the initial value. The 

coefficient Vmin has a very small positive value and is used because of the limited precision 
of the representation of real numbers used in computers. The coefficient determines 
how large is the loss of importance of the less recent iterations in computing of the values 
of the visit table. 



The values obtained from l|26|l are used in diffusing the weight function LUT, as 
was shown in (EEI), yet the visit table is also diffused, because of the reasons already 
discussed in Sec. I,S.2.21 The visit table is diffused like the weight function LUT, but 
without smoothing - it was decided to omit the smoothing here, because, in contrast to 
the weight function, the visit table does not directly affect the propagation of signals. 
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Thus, the following formula is used for diffusing the visit table: 

dj = Vll^{n+l)-Vj*{n + l) 
mj={vj*{n + l)+VlU{n+l))/2 

= Vrin + 1) (27) 
V^{n+l)=V^'\n + l) 
Vfin + 1) = + 1) + y;'^(n + 1)) /2 1 < j < r^es - 2 

KC.-i(n + l) = <f-i(^+l) 

V.=o,i,.....e.-i + j + ifC<0(n) ■ 

4 TESTS 

In this section, the presented networks will be tested and compared to some other neural 
and non neural learning machines. 

Unless otherwise stated, the following coefficients, selected in a number of preliminary 
trials, are used in the tests in this section: fi = 0.02, v = 2.5, r^-cs = 64, Imin = — 1, 
/max = l,ai = 0.15, ah = 0.35, = 1.1, C = 0.05, Rl = l- 10-^ Rl = l- 10-^ Vp = 0.1, 
Vmin = 1 ■ 10~^^, = 0.001. For the classic neural networks wit h linear weight function s 
only, to make their weight decay like the classic one described in Krogh and Hert j 1992l |. 



Rs{w, Aw) is linear because R^ = 0, and the weight decay coefficient Rf^ is equal to 2-10"''. 
For the networks with diffused weight functions Rs(w,Aw) is nonlinear and regularizes 
weight change at i?^ = 1, but the weight decay is weaker instead - the coefficient Rf, is 
equal to 1 • 10~^. To show the regularization capabilities of the presented networks, only 
the /i, rres and Rl coefficients will actually be fitted to different generalized data sets, 
except for some special cases, and the rest of the coefficients will be constant. 

The 'LW prefix is used to represent the classic feedforward networks with linear 
weight functions only, whereas the 'NLW prefix corresponds to the introduced networks. 
Following this prefix is the number of nodes in the subsequent layers, after the input 
layer. Thus, a classic neural network named LW 2-4-4-1 would have 2 nodes in the 
input layer, followed by two layers with 4 nodes in each, and finally a single node in the 
output layer. 

4.1 TIME COMPLEXITY 

The speed of a signal propagation through a connection and the time of modifying a 
connection weight can be substantially different for connections with the linear and the 
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a + bn 


a b 


LW training 


-OMtlil 0-013 


LW propagation only 


-0.28lo;58 0.004 


NLW training 


-IMttir 0.056 


NLW propagation only 


-0.39lo;84 0.010 







Table 1: Time complexity for a single iteration generalized using linear regression. 



LUT weights. In this section, some time measurements are provided for estimation and 
comparison of the learning and propagation performance of both the classic and the 
introduced networks. The time results are for a particular implementation and should 
be interpreted with care, of course. 

A number of architectures was tested, with different number of inputs, outputs, layers 
and nodes within a layer. The iteration time to the number of connections ratio was 
generalized by linear functions of the form a + bn, as shown in Table ^ where n is the 
number of connections, the time is in milliseconds and the +/- values indicate the lower 
and upper parallel bounding lines of the measured times, respectively. The difference in 
the iteration only times is only of about 2.5 times, what results from using a fast piecewise 
linear interpolation, resulting in the selective parameters property, in the nonlinear weight 
functions. 

The fitted function for learning times against r^^s is as follows 

nrl(rres) =38.li3i0g + 0.113rres 

For Tres increase of 16 times, from 16 to 256, there is a time increase per iteration of 
about 1.7. It is because of the selective parameters property of the weight functions, and 
because C = 0-05 <C 1 so rrcs value is critical for time performance in only about 5% of 
all training iterations. 

4.2 DIFFUSION OF LUT WEIGHTS 

The training set generalized in this section, for visualization purposes, is a raster image. 
In each sample of the set there are three attributes (x, y, v). The attributes x and y 
are the argument ones and represent coordinates in a two-dimensional mesh and the v 
attribute is the value one. The data set 'circle' is seen in Fig. I^ta). The image has the 
resolution of 64 x 64. The upper left corner pixel is at (—0.5, —0.5) and the lower right 
corner pixel is at (0.5, 0.5). Black pixels on the images are denoted by —0.5 and white 
ones by 0.5, with a gray scale between the two values. The NN generalization function 
will be shown in a similar way, but the values less than —0.5 or greater than 0.5 will also 
be shown as black or white pixels, respectively. The mask in Fig. El^b) shows by white 
pixels the respective samples that are chosen for the training. 

Let us first test the introduced NNs without the diffusion of the weight functions, 
to compare it later with NNs that have the diffusion. To disable the weight functions 
diffusion, let Rl = 0. Let the LUT smoothing will also be disabled by setting i?^ = to 
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(a) 



(b) 



Figure 5: The 'circle' data set (a) and (b) its mask. 



show the generahzation similar to that seen in Fig^c). Let the ( coefficient be equal to 1 
in the examples in this section, as in the section not the time efficiency is tested, but the 
generalization ability is presented, and C = 1 allows for smoother diagrams of weights. In 
Fig. ini images of the generalizing function representing the approximated training set are 
shown for a 1-1 NN at some different iterations. The used training set leaves relatively 
large regions in the space of the argument attributes unknown by a trained NN. Because of 
the generalization ability the NN should 'extrapolate' learned samples over these regions. 
As can be seen in Fig. El the NN without diffusion generalizes relatively poorly at the 
1 • lO^th iteratio n, showing prqbleni s resulting from the lack of both smoothness like it 



was described in IPiazza et al.l |l99,l |. In Fig. d diagrams showing the nonlinear weight 
functions and the visit tables at some iterations of the training process are presented. 
The intensity representing the values in visit tables is nonlinearly related to the values, 
to make the smaller ones better visible. The lack of diffusion of the LUT weight functions 
is clearly seen. 

Let us then test an identical NN, but with the standard values of i?^ and Rl of 1-10~^. 
In Fig. IHla much improved generalization can be seen after the first 1 • 10^ iterations, in 
compare to that shown in Fig. El In Fig. (HI the diffusion of LUT weight functions can be 
seen. 

Some tests with various values of the diffusion speed coefficient Rl will also be per- 
formed in Sec. 14.41 
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(a) 



(b) 



(c) 



Figure 6: Images of the generalizing function after (a) 1000, (b) 10000 and (c) 1 • 10^ 
iterations, respectively, of a 1-1 NN for an image 'circle', i?^ = 0, Rl = 0. 



4.3 SMALL SIZE DATA SETS 

The type of neural networks presented in this paper were designed for generalization of 
sets of a very high complexity or highly nonlinear, for example needing hundreds of thou- 
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Figure 7: A LUT weights diagram of a 1-1 NN for an image 'circle', = 0, Rl = 0. 
Connection xq s{l,0) (a) LUT component, (b) visit table, connection xi s(l,0) 
(c) LUT component, (d) visit table, connection s(l,0) —>■ s(2,0) (e) LUT component, 
(f) visit table. The i axes denote weight function LUT (a)(c)(e) or visit table (b)(d)(f) 
indices, Xj denotes jth input of the NN and s{l,n) denotes a node n in the Ith layer. 



sands of samples to be roughly represented, or that have patterns like the 'two spirals' set 
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Figure 8: Images of the generahzing function after (a) 1000, (b) 10000 and (c^ 
iterations, respectively, of a 1-1 NN for an image 'circle', = 1 • lO"'^, Rl = 1 ■ 



1 • 10^ 

10-1 



Lang and Withro'c^ |l988l |. tested in the next section. Many generalized data sets, how- 
ever, are muc h smaller and n iore linear. Yet, it still may be a difficult task to generalize 
them well - in DraghicJ [200 ll | performance comparison of several neural- and non-neural 
learning algorithms shows substantial variance in the percentag e of properlv classified test 
samples in the case of some data sets from the UCI repository Blake and Mer j 1998l |. of 
which all have less than one thousand samples. It can be important for a learning machine 
to have a good performance on a wide range of data sets, so in this section performance 
comparison is performed on some relatively small data sets. In this test, classification 
results of classic neural networks with linear weight functions, several other neural and 
non-neural learning algorithms, and the introduced neural networks are compared. The 
generalized sets are from the mentioned UCI repository and generally have patterns of a 
moderate nonlinearity. 

The coefficient was set to 0.02 to increase the diffusion rate so to counterpart the 
relatively sparse samples in the generalized sets. 

The generalized sets were randomly divided into training sets containing 80% of sam- 
ples and test sets containing 20% of samples. 10 runs of each tested architecture of the 
classic LW networks, and of the introduced NLW n etworks , were p erformed, and the aver- 
age val ues were shown. For compariso n, the SVM VapnikI |l995al |bl| machines of the type 



QP£ 

I/-SVM ISchoelkopf et al.l |20od . E"oOll | with radial basis function kernel were also tested. 
Because of their relatively high training speed in the case of the small data sets, SVMs 
were run with ten -fold crossvalida t ion o f the c and 7 coefficients. To test the SVMs, the 



LIBSVM package IChaug and Lhl |200l| was used. The classification results, averaged 
over the tested sets, for some other learning machine s: C4. 5 using classification rules 



LP usmg ciassmcation rules 

QuinlanI 1993l |. incremental decision tree ind uction ITI Utgofj lip^flT. ytgoff and Precupl 
linear mac hine decis i on tree LMD T lUtgoff and Brodle 



^ ^ |1991IJ . lea rning vector 

quant ization LVQ KohonerJ |l988l . Il99f)l |. induction of obli que trees OCI iHeath et al 



quant ization iKononeni iiHHnij . mauction 01 oDii que trees U(.yi ineatn et ai. 

Nevada backpropagation NEVP based on Quickprop iFahlmanI ll98?J7 fc- neares t 
neighbors with k = 5 K5, Q* and radial basis functions RBF IPoggio and Giros 3 |l99n| 



Musavi et al. 1992l | were computed using results f rom tests in Draghicil |200ll |. A very 
good comparison of the algorithms can be found in lEklundl |200(1 |. 

The training limit for the NNs was 10000 iterations. The tested LW networks had 
considerably more connections to make their single training iteration similarly fast to 
that of the tested NLW network. In Table |2l results are shown for the classic feedforward 
networks, the diffused feedforward networks and for I/-SVM with ten-fold crossvalidation 
of c and 7 coefficients. The letter X symbolizes the number of inputs, equal to the number 
of argument attributes in samples in a given set. The networks LW X- 16- 16-1 and NLW 
X-8-8-1 had similar time complexity, but the latter one performed better on average. 
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Figure 9: A LUT weights diagram of a 1-1 NN for an image 'circle', = 1 • 10~^, 
Rl = 1 ■ 10~^. Connection xq — > s(l,0) (a) LUT component, (b) visit table, connection 
xi —I- s(l,0) (c) LUT component, (d) visit table, connection s(l,0) •s(2, 0) (e) LUT 
component, (f) visit table. The i axes denote weight function LUT (a)(c)(e) or visit table 
(b)(d)(f) indices, Xj denotes jth input of the NN and s{l,n) denotes a node n in the ^th 
layer. 
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Tableinishows the average results for the same data sets, exc ept for RBF neu ral networks 
that have the result for the 'Zoo' se t miss ing, reported in IPraghicil |2nni| . for various 
other learning machines. In Draghicil |2nnil | detailed results for individual data sets can 
be found. The comparisons should be interpreted cautiously - the division into the 
training and test sets may be different, giving various classification results - within the 
learning machines LW, NLW and SVM tested by the author it was the same, though. It 
can be seen that SVM had best average results, and NLW was the second in all of the 
tests. It can also be seen, that the results of NLW networks are relatively similar for very 
different number of connections and two different Tres values. 



Data set 


LW X-f 


-1 LW X-16-16-1 


LW X-3 


2-32-1 i^-SVM 


NLW X-4- 

rrca = 16 


1 NLW X-8- 


^-1 NLW X-16-16-1 

rroa = 64 


Glass 


67.68 


66.74 


70.93 


74.42 


76.74 


76.05 


80.00 


Ionosphere 


94.71 


95.00 


9S.43 


92.86 


91.86 


93.29 


85.89 


Wine 


98.61 


95.28 


97.22 


97.22 


96.39 


94.72 


95.28 


Pima 


77.53 


77.53 


76.56 


77.27 


78.12 


75.91 


75.07 


Bupa 


60.72 


59.86 


63.63 


69.57 


60.87 


62.03 


66.09 


Tic tac toe 


96.93 


96.93 


97.04 


100.00 


96.78 


96.20 


97.04 


Balance 


90.64 


91.44 


90.72 


100.00 


96.16 


96.48 


95.68 


Iris 


96.00 


95.33 


95.33 


96.67 


94.67 


95.33 


94.66 


Zoo 


86.00 


90.50 


88.00 


85.00 


86.50 


88.50 


87.50 


Average 


85.42 


85.40 


86.10 


88.11 


86.45 


86.50 


86.36 



Table 2: Comparison of classification results of small size data sets for several LW and 
NLW networks and z^-SVM, in percents. 





C4.5 


C4.5r 


ITI 


LMDT 


CN2 


LVQ 


OCl 


NEVP K6 Q* RBF 


CBD 


Average 


79.89 


82.71 


82.25 


84.76 


83.52 


76.97 


75.61 


82.30 76.68 74.52 75.29 


81.94 



Table 3: Comparison of average classification results of small size data sets, the same as 
in Table |2j in percents, for several neural and non-neural learning machines. 



4.4 TWO SPIRALS 

Because the data sets tested in this section are relatively sparse as for the introduced 
networks, the resolution of the LUT tables was decreased to r^^s = 16 and, like it was 
done in tests in Sec, 14. 3( the diffusion was 'strengthened' by using relatively large Rl 
values. The NLW networks are compared to the classic ones and to some other learning 
machines. 

Let us first test the gene ralization of the 'tw o spirals' set, one of the standard bench- 
marks for learning machines Lang and Witbrock | ,198(i |. This set, after centering around 
the point (0,0), is shown in Fig. [TOlf a). Each sample in the set has three attributes 
{xo,xi,y), the first two being the argument attributes and the last one the value at- 
tribute. Even though the spirals may be regarded as relatively well defined because of 
the density of the samples determining them, this set is known to be a very hard two-class 
problem to learn by classic feedfor ward neural networks using the error backpropagation 



family of learning algorithms. In Lang and Witbro'c^ |l988| it was reported that the 
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Figure 10: The (a) 'two spirals' and (b) 'two spirals sparse' data sets. 



task could not be solved with the tested classic feedforward networks with connections 
only between neighboring layers, so to classify each sample in the training set, a special 
architecture was developed, where each node was connected to all nodes in the subse- 
quent layers and the network was trained using error backpropagation with momentum. 
Several other trials have been undertaken for improving the learning algorithms to train 
feedforward neural networks more efficiently. For example in Fahlman and Lebierd 
a learning algorithm is developed that grows the trained neural network by adding new 
trained units to the network. The algorithm was successfully applied to the two spirals 
problem, but even though the trained neural network learned to classify all samples in 
the training set, its generalization quality was relatively poor - the decision border was 
very rough and it even crossed the arms of the spirals in some places. Images of th e 
generalization function of the network can be found also in Fahlman and Lebierd 199nl |. 
The radial basis function neural networks iMoodv and DarkenI 1198911 even with t he ad- 
vanced techniques like dynamic decay adjustments iBerthold and Diamond! 1995l |. show 
th e problem of the lack o f a 'lo ng range' generalization - as can be seen in the images 
Berthold and Diamondl 1995l |. the samples are classified correctly, but the regions far 



m 



from the samples seem to have little or nothing in common to th e positive or negativ e 



values of the individual samples. The neural networks presented in lPerwass et al.l [2 



that have neurons whose decision borders are hyperspheres, have the problem w i th th e 



'long range' generalization as well, as can be seen in the images in lPerwass et al. 1 1200.'^ 



Let the generalization functions of the networks be sampled, and presented as two- 
dimensional gray scale raster images of the size 64 x 64, in the same manner as was 
done in Sec 14. 21 In Fig. ^2 generalization functions are shown for NLW networks trained 
at two different values of the diffusion speed coefficient Rl. F ig. [El shows classification 



results for these networks and for a Z/-SVM iSchoelkopf et al.. [2Q0Q,, 2001l | The results 
for the tested LW networks are not shown as they were not able to even classify the 
training set within 1 • 10^ iterations. The SVM performed very good on the set. The 
NLW network with the diffusion speed coefficient Rl = 1 ■ 10~^ generalized the training 
set with a somewhat rough decision border. Increasing the diffusion speed by increasing 
the value of the diffusion speed coefficient Rl to 0.01 caused that the generalization was 



much better. In Solazzi and Uncinil 200fll | a neural network is introduced with adaptive 
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multidimensional spline activation functions, and is shown that the network has excellent 
results of the generalization results of the two spirals set, similar to these in Fig. IT^ a) 
and (c), if the adaptive spline activation function is two-dimensional. Yet the 'two 
spirals' set has a generalization function that is also two-dimensional. If the adaptive 
activation functions h ave a single d i mens ion, like in the networks pr esented bv Uncini , 



Capparelli and Piazza Uncini et al. Il998l. Vecc i , Piaz za and Uncini Vecci et al. 
and Guarnieri and Piazza lOuarnieri and Piazzal so that the 'scale up' of the di- 



mension by the high dimension regressor i s non-zero for the 'two spirals' set, the images 
demonstrated in lSolazzi and Uncinil 2nnnl | shows substantial artifacts, much larger than 
these in Fig.lTWbV 



Iteration 
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1 • 10^ 
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Figure 11: Images of the generalizing function for the 'two spirals' set. 






Figure 12: Classification of the 'two spirals' set by (a) SVM with radial basis kernel 
function, c = 100, 7 = 150, and NLW 2-32-32-1 at the 1 • lO^th iteration, at (b) Rl = 



1 • 10" 



(c) Rl = 0.01. 



Let us now discuss another training set, derived from the previous one. Let the 
samples within each of the spiral arms be counted from the inner beginning of each arm. 
This set is created by removing each odd sample in one of the spiral arms and each even 
sample in the other arm. Let the set be called 'two spirals sparse'. This set is shown 
in Fig. EHb). Such a way of removing the samples was used to obtain a special type 
of patterns in the set. The patterns create two families, 'along arms' and 'radial', as 
illustrated in Fig. Uni It can be said that in the inner side of the spirals the 'along arms' 
pattern is stronger than the 'radial' one, because of the relationship between appropriate 
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Figure 13: Two families of patterns in the 'two spirals sparse' data set, denotes with 
different types of lines. 



distances between samples. Conversely, the 'radial' pattern is stronger in the outer regions 
of the spiral arms. The third type of pattern is created by the outer parts of the spirals. 
Because the halves are not covered from the outside by any samples, the value attributes 
of the samples creating the halves may possibly be extrapolated to the outside, so that 
outside the spirals the generalizing function may roughly have values greater than for 
xi > and less than for xi < 0. Let the task be to generalize the discussed set so that 
the strengths of the two families of patterns and of the third discussed pattern would be 
appropriately reflected in the generalizing function. The gradual transition of patterns 
in the discussed set will allow for testing the evenness of generalizing different regions of 
the space of argument attributes of the samples. 

In Fig. El generalization functions at some different iterations for NLW networks 
trained at two different values of the diffusion speed Rl are shown. Fig. El shows the 
classification results for i^-SVM and NLW networks at two different Rl values. The 
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Figure 14: Images of the generalizing function for the 'two spirals sparse' set. 



tested LW networks did not classify the training set within 1 • 10^ iterations. The NLW 
network with diffusion speed coefficient Rl = 1 ■ 10~^ generalized the training set but 
with rather severe problems. After increasing values of the diffusion speed coefficient to 
Rl = 0.01 the generalization was much better - all three discussed types of patterns are 
seen. Further increase of the diffusion speed coefficient Rl to 0.1 rather did not give 
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(d) (e) (f) 



Figure 15: Classification of the 'two spirals sparse' set by (a) z^-SVM with radial basis 
kernel, c = 1,7 = 500, (b) i/-SVM with radial basis kernel, c = 1,7 = 10000, (c) one 
class SVM with radial basis kernel, c = 100, 7 = 200, and NLW 2-32-32-1 at the 1 • lO^th 
iteration, at (d) = 1 • 10"^, (e) Rl = 0.01, (f) Rl = 0.1. 



any improvements - as can be seen in the generalizing functions in FigEl the learning 
process slowed down, and at the 1 • lO^th iteration, the generalizing functions seemed 
to be relatively less 'even'. The SVM with the settings like in the previous test with 
the 'two spirals' set, substantially underfitted the 'two spirals sparse' set, so it was with 
tested various different 7 values, yet none of the tested SVMs shown results so fine as the 
NLW networks - SVMs tended to giv e asvmmetric dec i sion b orders and to neglect the 
'radial' pattern, and the one class SVM Schoelkopf et al. [2001 1 gave particularly spurious 
results. 



4.5 LEARNING A LARGE SIZE SET 



In this section, the 'storage capacity' of a trained neural network against time is tested, 
that is the ability to memorize the fe atures in a set, and the sp eed of the memorizing. 
Generalization of sets like 'two spirals' Lang and Witbro'c^ |l988l | shows that even large. 



and thus slow LW networks might have serious problems with just the memorizing and the 
speed of the memorizing. The training set tested in this section is relatively complex and, 
to test the high dimension regressor, the samples have five input attributes each. Standard 
values of coefficients were used for training the neural networks with this complicated 
set. 

Because the author could not find a standard benchmark of the required complexity 
and number of samples, custom data set 'md-2' was used. Let the generating function of 
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the set be as follows: 

y (xo,Xi,X2,X3,X4) 



sin(43;o) cos(23;i + 3x2 

^ siii(10x2+10x3)+l ^ 1/2 



^|| sill(lUX2 + lUX3) + l j"/- _ 

sin(a;3 — Axix^) 

^^ sin(10xo-10x2+10x3)+l _ 

003(5X1X2X4). 



(28) 



Using this equation, tuples (xq, xi, X2, X3, X4, y) were generated, where Xj were random 
values 

Xi = rand() - 0.5 i = 0, 1, ... 4, (29) 

where random() is a uniform random number generator, generating values in the range 
(0,1). 1.8 • 10*5 such tuples were used in the training set, and 2 • 10^ independently 
generated tuples were used in the test set. The data set intentionally does not contain 
noise and is quite densely represented, to test the discussed 'storage capacity'. 

Fig. ^1 shows a diagram of mean square error, denoted by MSE, against estimated 
times for several LW and NLW networks, trained with the set 'md-2'. It is seen that the 
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Figure 16: MSE diagram for different neural networks trained with the 'md-2' set, against 
estimated time. 

tested NLW networks reach MSE even about ten times lower after similar training times 
in comparison to the tested LW networks. 

5 CONCLUSIONS 

The neural networks with diffused weight functions and with the property of selective 
parameters of the functions showed good performance over a wide range of tested data 
sets. In particular, they performed very good, in compare to the classic neural networks 
ant to the tested SVMs, in the case of the subtle patterns in the 'two spirals sparse' and 
'camomiles-m' sets. 
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